Audio Visual Interactions in Multimodal Communications

Abstract

Multimodal signal processing is more than simply "putting together" text, audio, images and video; it is the integration and interaction among these different media that creates new systems and new research challenges and opportunities.  Unimodal analysis of signals can deliver acceptable performance levels only in benign situations; the performance decreases rapidly when countermeasures are taken.  For example, person authentication systems useful in security, access control and surveillance applications do not perform well when subjects age, when the video resolution is inadequate, or poor lighting conditions are present.  Many of these difficulties can be overcome by adding an audio signature along with the video.

In multimodal communications where humans speech is involved, audio-visual interaction is particularly significant.  Human perception of speech is bimodal in that acoustic speech can be affected by visual cues from lip movement.  Due to the bimodality in speech perception, audio-visual interaction is an important design factor for multimodal communication systems, such as video telephony and video conferencing.  A prime example of this interaction is lip or speech reading.  It is used by the hearing-impaired for enhancing their speech understanding capability but also by every normal hearing person to some extent, in particular in noisy environments.

One key issue in bimodal speech analysis and synthesis is the establishment of the mapping between acoustic and visual parameters.  A novel approach for establishing this mapping was developed during our previous funding period.  Our current work addresses two inter-related problems.  First, the synthesis of articulatory parameters for an MPEG-4 facial animation model is being considered.  Second, we are concerned with the task of robust speech recognition.  Fusing these two areas will impact the fields of very low bit-rate coding of speech and images, speech and text driven facial animation parameters, speech and text driven facial animation of synthetic actors (i.e. avators) and audio-visual speech recognition.

Students

Publications

  1. J. J. Williams, J. C. Rutledge, D. C. Garstecki, and A. K. Katsaggelos, "Frame Rate and Viseme Analysis for Multimedia Applications,'' Proc. IEEE First Workshop on Multimedia Signal Processing, pp. 13-18, Princeton, NJ, June 23-25, 1997. View Document
  2. J. J. Williams, J. C. Rutledge, D. C. Garstecki, and A. K. Katsaggelos, "Frame Rate and Viseme Analysis for Multimedia Applications,'' Journal of VLSI Signal Processing Systems, vol. 23, nos. 1/2, pp. 7-23, Oct. 1998.
  3. J.J. Williams, A.K. Katsaggelos and M.A. Randolph, "A Hidden Markov Model Based Visual Speech Synthesizer," Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Istanbul, Turkey, June 5-9, 2000.
  4. P.S. Aleksic, J.J Williams, Z. Wu, and A.K. Katsaggelos, "Audio- Visual Speech Recognition Using MPEG-4 Compliant Visual Features", EURASIP Journal on Applied Signal Processing, in submission.

Theses

  1. J.J. Williams, "Speech-to-Video Conversion for Individuals with Impaired Hearing," Ph.D. Thesis, Department of Electrical and Computer Engineering, Northwestern University, June 2000. View Document

More Information...